A Genomic Perspective on 
Protein Families 

Roman L Tatusov. Eugene V. Koonin/ David J. Upman 

from five maJorphylogeneticiager^deTJ^^^^^^^^^ 
simHarffaes allowed the delineatio?of So cSi nf 

£2? r^'f °f individual orthDlogoufprSSjrorth^n';"^^^^ ^^^^ 
east three lineages. Orthoiogs typicSirhavL th. f 

functional information from one n«S 2f function, allowing transfer of 

yields a number erf funcTorjSana f?r oooT!:fH'^°S- ™' atrtonSS^ 
compr^ a frHmsworK for fun^ctir°anySoy,rorXl""- 



The release in 1995 of the complete geO 
nome sequence of the bacKrium HacmoftL 

(2). one aichaeal genome (3), and one geO 
nome of auiucellulareukaryote (4). marlid 

■Siw^fli-^Ttit'^^"^--^ 

an mdispoisable componem of our unde^ 

^1h. n ' "^"'^^ °^ biological phenomO 
e^erS " of sequenced genomes is 

A? , f f ^""^ «POrientiaUy for at least 
the next few years, and conceivably, their 
■mpact on biology wUl further incn:L (5 

Knowmg the inventory of conserved 
genes responsible for housekeeping ftmcO 
uons and urjderstanding the diSceT?^ 

understanding life itself, at least at the level 
of a smgle ceU. Complete sequences Ke 

cause they hold the only type of infbimaO 
CLon diat can be used to delineate the comO 

SesW^ff «l«^^°-hips between 
genes from different genomes. Furdiermore 
only widi complete genome sequenSTk ft 
possible to as^rtain Aat a paiicu^^.0 

noTe^fe'^ ^ 
Iv «r W ' ^'""^ AccordingO 
ly. ar^ altKnatiye protein for the respective 
fonction should be sought among^ wO 
nonally unassigned gene products (6). With 
multiple genome s^uences, it is po^M. t 
delineate protein fkmilies that L highly 
conserved, m one domain of life but SI 

niay be cntieally important: For example. 



ccmserved among bacO 
^na but are missing in eukaryotes comprise 

mL'^a^dfe^"^'^"^^^-''^-^ 
The biowledge of all of the gene seO 
qu^ces fern multiple complete |e^„^L 
h bS^ *!.P«'blem of gene classification. 
It becomes fe^ible to replace the more or 
S t^li^Hing of genes by similarO 

•ty wth a complete, consistent system in 
which the groups are likely to havrevolvS 
fror^ a smgle ancestral gene. Such a natural 
da siftcanon of genes wiU provide a WO 
work fbr evolutionary studies and for mpid, 
largely automatic fimctionai annotatio^ of 

^ evolve and improve with increasing 
coverage of the diversity .of life forms with 
complete genome sequences. It is critical to 
have this system in place while the number 
ofcompleted genomes is stiU small and each 
^.ly c^ be explored individuaDy. Here 

^iTtl^'T'^' of a^aniial system 
of gene families fbm complete genomes - 

Orthoiogs and Paralogs: Derivina 
Clusters of OrthoJogf us Groups 



The relationships between genes from difO 
a^JTn/T "^.'^^y represented as 
elude both orthoiogs and paralogs. OrO 
tho ogs ^ genes in diifenmt spedes that 
evolved from a common ancesSl geneT 

'^"Pi'^°°" ^idiin a genome (7). 
Normally orthoiogs retain die same funcO 

pardogs evolve new fimctions, even if reO 

?n^.f V f"^^' "^'^ "^"^^ identificaO 
non of ortiwlogs is critical fbr reliable pieO 
dicaon of gene functions in newly '^O 

-q-^y important 
tor phylogeneoc analysis because interpretO 



www.sciencemHg.org • SCIENCE • VOL 27B • 2A OCTOBER ,997 



able phybgeneric trees generally can be 
c^nsmicted only widiin sets of ^orSloS 
(^A complete list of orthoiogs also is^ 
pra^uisite ft,r any meaningfiil comparison 
or genome otgaaization (9). • 

simnln"^"'^'' opoftional definition wouU 
simply mamcam that fbr a ghren gene fbm 
one gmome, Che gene frtnn another genome 
widi die h^hest sequence similari^TSe 
ortholog. Gh-en the complete genome seO 
quences, riiis stiHightforwaid approadi ofO 
^ gives cinlible results, especially when 
toe compared species are not too distant 

phylogmeticaJly(9).At larger phylbgenS: 
. taccs. however, the situatL bS^ 

^ compl,cat«L If gene duplications ocO 
cuned m each of the given two clades subO 

sequent to their divergence, only a 
tojfaany refetionship will adequately deO 
scnbe OTthologs. and accordingly, detection 
of the bluest smiilaricy will not result in ' 
Ae idendficauon of the complete set -of 
orthoiogs. In addition, when the best hit is 
highly significant statistically, which is 
common m the case of phylogeneticallv 
distant rektiorjships (10), it sLpTSTbe' 
^unous. On the other hand, ^ 
apply a restrictive simUarity cutoff are likely 
ro. re^ilt m a number of orthoiogs bein^ 

Given the existence of onelioGbany and 
nmnyUolAiany orthologous relatio Jh!^ 
we mdrfmed die task of identifying ^5 
Aolop as die delineation of SL af 
orthobgous groups (COGs). Each COG 
co^in. of individual ordiologous genes or 
orthologous groups of paralogs from three " 
niorephylogenetic lineages. In other w«ck 
aay two proteins fbm different lineage^ 
l^lg^ to die same COG are.orthobgs. 
tach COG IS assumed to have evolved from 
an individual ancestral gene dirough a s^O 
nes of^eciation and duplication evenV 

in onier to delineate the COGs, all oairO 
wu. sequence comparisons am^^^^ 
17 967 protems encoded m die seven comO' 
piMe genomes were perfonned ( J i). and for 
^ protem. die best hit (BeT) in each of 
the odier genomes was detected. The idenO 
t^icanon of COGs was based on ccSislSt " 
patterns m die graph of BeTs. The simpS 
and rnost miponsnt of such patterns is a 
^•angle, which typically coiists^ "^.b 

one of die compared genomes has BeTs in 
mo odier genomes, it is Hi^y unlikely diat . 
the respective genes are also BeTs fm one 
^odier unless diey are bona fide or£bgs 
y^). Iht consistency between BeTs resultO 
•ng m triangles does not depend on die 
absolute level of similarity betweS Se 
compared proteins and rhus allows die deO 
ou^r, among bodi slowly and 

qmddy eyolvmg genes. This approadi k 
most Lkely to be infr,rmative whj,\^ 



EXfflBIT 2 



B31 



BEST AVAILABLE COPY 




BtTs forming a mangle come from widely 
dittcpent lineages. Accordingly, only five 
major, phylogenetically distant dades were 
used as independent contributorB to COOs: 
GramUiegatiye bacteria (Escherichia coU and 
H. m/IueniBfi). GramSositive bacteria (M> 
c^iasma ismtaHum and M. pneumamae) 
Cyanobacteria {Synechocystis sp.), Ajchae^ 
(Euryarchaeota) {MeAarwcoccus jannaschii) 
and Eukarya (Fungi) {Sacchmmyces cerevC 
sm) (13). 

The procedure used to derive COGs inO 
eluded finding all triangles fbrmed by BeTs 
between the five major clades and merging 
those triangles that had a common side 
until no new ones could be joined. A triO 
angle is an elementary, minimal COG (Fig. 
lA). The groups produced by m^ing adO 
jacent triangles include orthologB from dtfO 
feent lineages and, in many cases, paralogs 
from the same lineage (Fig. 1, B and C). 
Because of the existence of paralogs, the 
BeTs that form the triangles are not necesO 
sanly symmetrical: For example, in the 
COG shown in Fig. IC, Ae same M. geni^ 
tahum protein, MG249, is the BeT for four 



paialogous c subunits of E. coU RNA polyO 
meiase, but only ftjr one of diem, RpoD, is 
the relationdiip symmetrical. 

Most of the clusters derived by die above 
ITOcedure meet the definition of a COG, 
that is, all of the proteins from the different 
linages in the same cluster are likely to be 
orthologs. There are, however, several reaO 
sons why, in certain cases, COGs may be 
lumped together. Proteins may contain two 
or more distinct regions, each of which 
belongs to a different conserved family; usuO 
ally such proteins are loosely referred to as 
multidomain (H). Each of dxe clusters was 
inspected for the presence of multidomain 
proteins, individual domains were isolated 
{15), and a second iteration of the sequence 
comparison was performed with the resultO 
ing database of domains. Some of the COGs 
may include proteins from different lineages 
that are paralogs rather than orthologs, priO 
marlly because of difibrential gene loss in 
the major phylogenedc hneages. When one 
gene in a pair of paralogs is lost in one 
Imeage bur not in die others, two COGs 
that should have been distinct may be artiO 



Fie. 1 . E>campiss of COGs. Solid lines show sym- 
metrica) BeTs. Broken lines show asymmetrical 
BeTs, with color correBponding to the species tor 
which the BeTis observed. Genes from the same 
species are adjacent; othenwse the gene names 
are posmoned arbitrarily. A unique COG ID Is indi^ 
cated in the upper left comer. (A) Congruent BeTs 
form a triangle, the minimal COG. Origin of the 
proteins: KatG, £ cofr; sli19B7, Synechocystis 
sp^and YKROBBc, S. ceravisiBe. Note that all the 
BeTs are symmetrical, (B) A simple COG with two 

k^oPl^ '*enzBe; MG345. M, genitalhjm; 
MJ^22. M pneumoniae; tvU0947, M. jannasc^ii: 
and YBL076C and YPL040c. S. cerevisiae. Note 
the adjacent triangles with a common side, for 
example, liaS-MG345-(vlJD947 and bI|-I362- 
MG345-MJ1352.-YPL040C Is the yeast mito. 
chondrial isolBucyl-tRNA synthetase; the bacterial 
orfriologs and that from M Jannaschii are 'the 
BeTs forthis yeast protein, but the reverse is toie 
only of the bacterial proteins (symmetrical BeTs) 
ConverBBly, for YBL076c, which is the yeast oytci- 
plasrnic Isoleucy^tRNA synthetase, the M. jann^ 
aschd orthoiog is a eymmetrical BeT, whereas the 
bacterial BeTs are asymmetrical. (C) A complex 
COG wWi multiple paralogs. Origin of the pnsteins: 
HpoH, RpoS. RpoD, and RIA, £ cofr. H1N1403 
and HJN1B55, H, infiuenzae; MG249, M. geni- 
taSunj; MP4B5. M. pneumoniae; sI101B4, sUOSOe 
slrOB53. sUlBSg, sn2012, and 8^564. Synecho^ 
cystis sp,.RpDD, HIN1655, slrOB53. and MG249 
are major sigma factors (ir70). whose function is 

IfJ^If'^^L!!] ^5^^"^: +ully symmetrical 

relationsh^B between these proteins. The other 
proteins are specialized sigma factors whose ra- 
dation from the ancestral family apparently was 
accompanied by modification of the function and 
involved accelerated evolution; note the asyrr*- 
metncal BeTs. 




632 



SCIENCE . VOL. 27B • 24 OCTOBER 1997 • www.sciencemag.org 



ficially joined. Therefore, the level of seO 
quttice similariry between the members of 
each cluster was analyzed, and dusters that 
seemed to contain two or more COGs were 
spliL 

Phylogenetic and Functional 
Patterns in COQs 

The described analysis resulted in 710 apO 
parent COGs. This set appears to be essenO 
tially complete as fer as qrthologous relaO 
tionships are concerned. Indeed, when the 
portion of the database of proteins from 
complete genomes not included in the 
COGs was clustered by sequence similarity 
(i6). only 10 groups were identified, which, 
upon careful irispection of the alignments, 
were considered likely to constitute addiD 
tional COGs missed originally. These 
groups were incorporated, producing the fiO 
nal collection of 720 COGs, including 6614 
proteins and distinct domains of multidoO 
P^^ins f 6646 distinct gene products, 
or 37% of the total number of genes in the 
seven complete genomes) (17). 

Most of the COGs are relatively small 
groups of proteins. OneOhixd of the CGGs 
(240 COGs with 1406 proteins) contain 
one representative of each of the included 
species (no pambgs), and 192 more COGs 
include paralogs from only one species, • 
most frequently yeast (87 COGs). The 
mean number of proteins per COG increasO 
es with increasing number of genes in a 
genome, from 1.2 for M. gsmtalium to 2.9 
for yeast. A notable aspect of many COGs is 
the differential behavior of paralogs. It is" 
typical diat one of die paralogs, ioi examO 
pie, m. yeast, shows consistently higher simO 
ilarity to the orthologs in all or most of the 
other species (Fig. 1, B and C). For numerO 
ous yeast paralogs, particularly components 
of the translation apparatus, the underlying 
cause is obvious: the gene whose nroduct is 
most similar to the bacterial orthologs is of 
mitochondrial origin (Fig. IB). A more 
common explanation for x±& asymmetry of 
the relationships in die COGs. however, is 
that the highly conserved paralog has reO 
rained the original function, whereas the 
Mictions of die less conserved paralogs 
have changed in the course of evolution, in 
the already considered e>::ainple (Fig. IC), 
the symmetrical component of the graph 
(soUd lines) delmeates die conserved funcO 
tion of die o70 subunit of die RMA polyO 
merase (E. coliRpoD), which is required for 
the transcription of die bulk of bacterial 
genes, whereas die asymmeaical BeTs (broD 
ken Imes) are observed for a subunits (E 
coU RpoH, RpoS, and ¥lh^ ) involved in the 
ET^OT^ion of specialized gene subets 
{10). This phenomenon, appears to be 
widespread, as we found 54 9 proteins in 302 



BEST AVAILABLE COi 



COGs whose corresponding paiabgs 
showed confit£tently lower simUarity to ochO 
er membeis of the COG. One may think of 
the rapidly evolving paralogs as progenitors 
ot new families emerging from within the 
conserved ones. The COGs will be an imO 
porcant resource in a systematic survey of 
the functional diversification of paralogB in 
conserved gene families. 

There are several large clusters in the 
current collection wiA complex relation 0 
ships between memben. Two of these» 
namely the adenosine triphosphatase (At5 
Pase) con^jonents of ABC transponrers and 
hxsadme kinases, each include over 100 
menders. It is likely that subsequent deO 
tailed analysis of these large groups (for 
example, by phylogenetic tree methods) 
Will result in their spUt into several distinct 
COGs, especially when more genomes are 
available. On a more general note. COGs 
do not supplanr.traditional methods of phyO 
logenedc analysis but rather provide die 
appropriate starting material for dies^ 
methods, in particular fbr^a systematic analO 
ysis of phylogenetic tree topology. 

Figure 2 shows the breakdown of die 
COGs by broadly defined flmction (19) and 
°V species (20). For- die majority of the 
COGs, the protein function is either known 
from direct experiments, mainly in £. coU or 
yeast, or can be confidendy inferred on the 
basis of significant sequence similarity to " 
functionally characterized proteins from 
other species. It has to be emphasized that 
constmction of the COGs includes autoO 
matic predicrion of the function for numerO • 
ous genes, particularly f^om die poorly charO 
acterized genomes such as M. jarmaschU 
1^^^^^^^°^^''^^' 2 fi"t>stantial fraction of 
^e COGs (14%) fbr which only general 
nincnorial prediction, typically of biochemO * 
ical activity, but not die actual cellular role 
could be made, and for anodier 5%, there 
was no fimctional clue (Fig. 3). Each of the 
CULrs mcludes proteins f^om at least three 
major clades whose divergence time is estiO 
mated to be over a billion years (2J), that 
IS, they all are ancient, conserved fkmilies 
with important, if not necessarily essential 
ceUular functions. TTieTefore, the proteins 
belonging to die "mysterious" COGs are 
good candidates for directed experimental 
studies. 

The distribution of proteins from differO 
ent species in the CGGs shows several 
trends (Fig. 2), although the bias in die 
current collection of complete genomes (in 
particular, because three lineages are re 0 
quired to fonn a COG, all COGs had to 
have a bacterial member) must be taken 
into account when interpreting diese comO 
pansons. The faction of proteins belonging 
to COGs is greatest in the nearly minimal 
genomes of mycoplasmas (70% for M. geni- 



talium) and much lower in the larger geO 
nomes of E. coH and yeast (40% and 26% 
respectively), which indeed is the tendency 
aqptcted of conserved fkmilies presumably 
associated widi cellular housekeeping funcO 
tions. The genes of the padiogenic bacteria 
{n. mflusnzaa and two mycoplasmas) are 
essentially subsets of die nvo larger bacterial 
genecomplements, E. coU and Synechocysas 
sp. The latter two species almost always 
coliccur in die COGs. The main cause of 
the obsen^ed congnjency is likely to be the 
conservation of the core of ancestml bacteO 
naJ genes m nonparasitic species firom difO 
tmnz major clades. Accordingly, the fkct 
that proteins fbm die padiogenic bacteria 
are missing in many COGs most likely cesO 
tif les to gene loss, which has been extensive 



even m this subset of highly conserved 
P"""* coQccurrence of M. jcmrmctm 
in a COG widi E. coU or Synediocysxis is 
measurably more frequent than that with 
ye^t (Fig. 2). Such a distribution of the 
archaea genes appears to be due primarih 
to die blending of bacterialQSke and eukarvO 
ODcUUce genes in die ardiaeal genomes 
(iO). although the mentioned bias in the 
genome collection is also a factor. 

The phybgenetic distribution of die 
COG members is distinct fbr dififerent funcO 
nonal classes (Fig. 2). It is not unexpected 
that translatirai is die only category in which 
ubiquitous COGs are predominant Anodier 
obvious trend, is the absence of proteins from 
padiogenic bacteria (H. ir^iuenzat and, parO 
ticularly, the mycoplasncias) in many COGs 




53 Replication, 
515 repair. 

recomblnaiiDn 



37g MoiBcular chBpBrones 



^ Outer msmbrsne, 
2« cell wall bioflBnesls 




H 



_70 Enerpy production 
615 and conversion 



3_32 CBrbohydratB mBtebolism 
|J27D end transport 




Amino acid 
melabolism 
and 

transport 



column Ehows a COG; a double s^;ak"ndoa^L?w' cereviskB. Each 

belong to the particular COG. Thel^SLrr^ ScSL^p^^T'^H^r^^ S*^^" ^P^^^ 

(denominmor) Is indicated tor each S^SSSi S^^^ 

TuncUonal categories (used irt the COGIDsr ^ ^' '^'^ ^ tettmost field encode the 



www^ciencenias.org • SCIENCE • VOL. 278 • 24 OCTOBER 1997 



633 



BEST AVAIUBLE COpy 




m each ftmctional category other than transO 
Iflcion and transcriprion, but especially in the 
metabolic functional claBses. Conversely, the 
congruence between the two nonparasitic 
bactfiTia, E. coU and Syriechocystis sp., holds 
tor ali functional classes (Fig. 2). Also apparU 
ent IS the drfiFerential appearance of archacal 
proteins that tend to group with yeast proO 
teins in die translation and transcription 
classes (which, given die bias in the genome 
collection, results in ubiquitous COGs) but 
m bU other functional classes are frequently 
foimd in COGs with bacterial proteins only. 

The phylogenetic distribution of COG 
memberBhip can be conveniently presented in 
tenns of "phylogenetic patterns," which show 
the presence or absence of each analyzed speD 
cies (Fig. 3). Of the 88 parrems that include at 
least three lineages (die definirion of a COG) 
36 were actually found. Missing were mosdy 
patterzifi widi only one of die two species of 
Mycophsrrm, which was predictable because 
the gene complement of M. genitc^um is esO 
sentially a subset of the M. pneumoniae comO 
plement (22). The remaining eight patterns 
that were never observed all include paihoO 
genie bacteria without £. coli, which is the 
largest and most diverse of the available bacO 
rerial genomes. The two most abundant patO 
terns could easUy be predicted: all species 
( ehgpcmy ), and all species except for the 
mycoplasmas ("eh_cmy"). What appears 
much less trivial is that these patterns togethO 
er encompass only oneOhird of all COGs. 
Tliis fact emphasizes the remarkable .fluidity 
of glomes in evolution, revealed in spite of 
the fecc that the analysis concentrated on 
ancient conserved fkmilies. Multiple solutions 
for die same imponant cellular function apO 
pear to be a rule rather dian an exception, at 
least whenphylogeneticall^ distant species are 
considered (JO. 23). On die other hand, the 
eight most frequent patterns, which together 
BccDimt for 85% of the COGs, aU include' 
bodi E. coil and Synechocysm, emphasizing the 
congniency between riiese genomes. 



^ 114 ubiquitous COGs, most of difiin 
including components of the translation and 
transCTptwn machinery, form die universal 
core of life. This set is more than nvofold 
down from die bacterial "minimal set" conD 
sistmg of 256 genes (23), but significant 
tather erosion seems unlikely, given die 
bTMd spectnim of compared genomes, 
T^e higher order distribution of die 
\. LtU ^oxazins of life, widi 

only 45% of the COGs including represenO 
taaves of Bacteria, Archaea, and Eukarya, is 
anothK manifestation of the dynamics of 
gene families in evolution (Fig. 3). The 
picture is expected .to become even more 
complex, and the faction of threeflomain 
COGs wdl probably drop, once archaealO 
only, eukaryoticldnly, andarchaealQndQuO 
karyotic COGs emerge with the accumulaO 
tion of genome sequences. 

The unusual, rare patterns are of particO 
uiar mterest, suggesting die possibflicy of 
unwrpected findings. Each of die COGs 
widi part«ns diat occur only once in our 
cun-ent collection (Table 1) should correO 
^ond to a unique function scattered over 
disconnected branches of the tree of life. 
Why such functions are conserved and are 
presumably important for survival in some 
but not odier lineages is a challenge to be 
addressed experimentally. The principal 
evoiutKJnary mechanisms diat can be inO 
voked to explain die emergence of diese 
rare patterns are differential gene loss and 
horizontal transfer of genes. Some of die 
functions involved, for example, lipoateO - 
protein ligase and glycyl^nansfer ribonucleO 
ase (tKNA) syndietase, appear to be strictly 
essential but in" different species, diey are 
performed by two distinct sets of ordiologs 
unrelated to one anodier (24). Other fhncO 
tions, for example, diymidine phosphorylD 
ase and hexuronate dehydrogenases, may be ' 
dispensable under most conditions, and acO 
cordingly differential gene loss is likely; it is 
remarkable, however, that diese flmctions 



are preserved in die nearly minimal gene 
complements of the mycoplasmas. Two of" 
the unique patterns, namely ''_gpc^y " and 
-hgp_y," might have evolved through 
horizontal transfer of typical eukaryotic 
genes into bacterial genomes. The latter 
pattern is of particular interest as it involves 
me choline kinase gene common to a numO 
ber of bacterial padiogens and implicated in 
padiogenicity (25). Two of die COGs widi 
unique patterns, 'li_c_y" and "e^ my," 
indude highly conserved but uncharacterO 
tted proteins whose functions couW be preO 
dieted only by detailed analysis of conO 
served protein motifk (Table 1). These exO 
ampl^ demonstrate die potential fbr proO 
tern function prediction inherent in the 
consnucDon of the CX>Gs themselves. 

The sampling of genomes we compared 
is small and biased, and when a more comO - 
P}5J? *f^^ available, die distribution of 
COGs by phylogenetic patterns is likely to 
change significandy; for example, many 
patterns diat are currendy rare may become 
commOTi when larger genomes from die 
.UTamLjbositive bacterial lineage (such as 
BaaUus subtiUs) become available. NeverO 
theless, we believe diat the language of 
phylogenetic patterns will become even 
more useful for the description of relationO 
ships between multiple genomes. 

Connecting and 
Expanding the COGs 



Bncteria^dkary. Bacteria *Euharya Bacfce^ia*A«hae 




B 

Fig. 3. Phylogenetic patterns in COGs Letter codes a=; in Rn o r.r,„« 

absence at the respeotK^e species. ShacSiSS" ?^'"eiS ^Ku^'^p^r ^'"^ 



634 



SQENCE . VOL 278 • 24 OCTOBER 1997 • www.sciencemag.orE 



Ancient femUies of paralogs that span a 
broad range of caxa are well known (26). 
Accordingly, a number of COGs are related 
to eadi odier and can be connected into 
superfiamilies, In order to elucidate the suO 
perfkmily stnjcture of die COG coUection 
. we used the recently developed PSI IJLAsf 
(position bfpecific irerative BLAST) proU 
gram, which combines BLAST search widi 
profile analysis (27). Two COGs were conO 
Eidered connected if at least two of the 
proteins from the first COG hit members of 
die second COG in the PSIIJLAST search 
•and vice versa. Clustering by this criterior! 
produced 58 superramilies including 280 
COGs. 

Compared to COGs diemselves, the suO 
perfkmilies are a higher level of protein 
classification. Typically, tKey include conO 
served motifs diar are dererminants of a 
distinct biochemical accivtrv, which, howO 
ever, may be required for a variety of ceUuO 
lar ftinctions. For example, the largest suQ 
perfamily contains 53 COGs widi 863 proO 
terns, all of which contain conserved motiis 
typical of ATPases and GTPases but are 
involved in a broad range of processes fom 
UNA replication to metabolite transpon 
(2d). 

SupeifeQilies and their signature motife 



will be useful in classifying proteins that 
. have evolved to an extent that diey canD 
not be assigned to any COG but still 
retain a conserved motif. We sought to 
detect such proteins with distant, subtle 
similarity to COGs that might be encoded 
in the analyzed genomes. The PSIIJ3LAST 
analysis (27) detected "tails" of distandy 
related proteins (a total of 3686) for 321 
COGs, increasing the total number of proO 
teiru connected to COGs to 10,332 (58% 
of the entire protein set from complete 
genomes). 

Because apparent onhologB from at least 
three major clades were required to form a 
COG, there are potential new COGb hidO 
den among the resulc of the comparison of 
protein sequences from complete genomes 
(I J). Clustering by sequence similarity the 
proteins not included in COGs (J^) resultO 
ed in 443 groups with membeiE from two 
clades. Predictably, the greatest number, 
204, were from die cyanobacterial and 
GxamUiegative clades, followed by 67 
groups combining yeast and M. jannaschii. 



Many of these groups are likely to become 
COGs once additional genomes Eire includO 
ed in the analysis. 

Prediction of Protein Functions 
Witt the COG System 

The COG system allows automadc ftmcO 
tional and phylogeneuc .annotation of 
genes and gene sets (29). As in the proceO 
dure used for the construction of the QDGs, 
the criterion for adding likely ortholpgs 
from other genomes to the COGs is bas«i 
on the consistency between the observed 
relationships. A protein is compared to the 
database of protein sequences from comO 
plete genomes (11) and is included in a 
COG if at least two BeTs fall into it. Given 
diat the COGs were constructed -from proO 
teins encoded in complete genomes, it is 
not a requirement diat newly included proD 
teins also originate from a complete geO 
nome. Indeed, while die unsequenced porD 
tion of a genome may encode proteins with 
die highest similarity to those included in 



COGs, die BeTs will not change fer the 
products of already sequenced genes. 

As a demonstration of the principle 
coupled with additional characterization 
of the COGs themselves, the sequences of 
proteins with known three dimensional 
structures from the PDB database (30) 
were compared to the protein sequences 
encoded in complete genomes. The "two 
BeT^ procedure resulted in proteins with 
known three iSimensional structure being 
included in 183 COGs, of which one was 
shown to be a frtke positive by subsequent 
alignment analysis. Thus, structural inforO. 
mation could be infen-ed for at least 25%^ 
of the COGs. In most cases, the structurO 
ally characterized protein (from £. coli' or 
yeast) actually belongs to a COG or is a 
closely related homolog of the proteins 
forming: a COG. 

. Some of the predictions, however, proO 
vide significant functional and structural 
inferences. Of particular interest are (1) 
the possibility of modeling the nuclease 
domain of polyadenylare cleavage factors 



Segoi: to ,Sh"K?c'"S^^^^ P*^'" - - - ^'^ 3: each COG ID Inoludes a.letter mdicaBng ihe tunctbnal 



Pattern and 
COG ID 



Proteins 



Activity or function 



Comnnent 



e^P_m_ DeoA-MG051 -MP090- 
COGD2T3F MJDBB7 

e_P-«y MtID, UxaB. UxuB, Ydfl, 
COG0246G YeiC>-MP1 90-YEL070w 

YNR073C 

e-9P_y LplA-MG270-MP450- 
COG0D95H (s)10B09)-YJL046w 



eh_pc_y AdhC -f 1 B £ coll 

COG0604R ■protBins-MP27S-sII0990, 
sini92-YBR046c + 19 
yeast proteins 

-h_c_y - HlN1B93_1-slt1B21- 

COG057BR' YLRlOgw 
--«gpc^ MG10B-MP586-S111771- 

COG0B31R sll1033"Bll0602-YDL0D6w 
+ 6 yeast proteins 

-^P^rny MG251-MP4B3-MJ022B- 

COG0423J YPROBIc, YBR121C 



e_gp_my b2300-MG207, 
COG0B22R MP029-MJ0623. 

MJ0935-YHR012W 

ehjDcmy Argl. ArgF, 

COG007BE YgeW-HINOOl 2-MP531 - 

snD902-MJ08BVYJLD88w 
-hgp_y HIN093B-MG356. 
COG051 OM* MP31 0-YDR1 47w. 
YLR133W 

•This COB was Bdded to the collection by cluster analysis. 



Thymidine phosphoryl&se; 
salvage of deoxypyrimldines 

Mannito!-T -phosphate and 

. other hexurbnate " 

. dehydrogenases; hexuronate 
catabolisrin 

LipoaterprDtein ligase A; ligation 
of iipoate to apoproteins of 
pynjvate dehydrogenase and 
other lipoatB-dependent 

. enzymes 

Alcohol dehydnDgenase class Hi 

and related. Fe-S 

dehydrogenases; various 

catabolic pathw/ays 
Glutaredoxin-like nnembrane 

protein (prediction) 
Protein serine and threonine 

phosphatase 

Gtycyl-tRNIA synthetase 
(eukaryotic and Gram-positive 
type) 



Phosphoesterase (prediction) 



Ornithine carbamoyltransterase; 
arginine biosynthesis 

Choline kinase (prediction) 
involved in [ipopoiysaccharide 
biosynthesis 



Nonessential gene in £ co*'; apparent orthoiogs found in 
other Qram-posftlve bacteria and in humans (35). 

Nonessential genes in £ coJ?; accessory reactions of 
carbohydrate metabolism (5^. 



There aretm unrelated classes of Iipoate- protein igases- 
£ coy? and yeast encode both forms; H. mfluemB and' 
SyneohocystiB sp. encode the B form fincluded ina 
separate COG); sIIDB091s a distant homolog of the A 
fonn (37). vi/hlch was not automatically included in the 
COG but was detected with PSI-BLAST. 

Highly conserved protein family distinct from other Fb-S 
oxidoreductasBs. 



The H. inftuenzae protein contains an addttbnal 

thioredoxin-like domain. 
Serine and threonine pratein phosphatases are abundant 

in eukaryotes but not in bacteria {3B). 

Gram-negative bacteria and SynB:^ocysUs encode a 
distinct glycyl-tRNA that appearB to be unrelated to the 
eukaryotic and Gram-posftivetype; the closest relative of 
this COG In £ coll and H. mftuenm& is prolyl-tRNA 
synthetase (24). 

Highly conserved protein fannily that shares oniy modlfted 
catalytic motifs (detected by PSI-BL.AST; P — 0.004) 
with ottier phosphoesterases, including protein 
phosphatases. 

Amino acid metabolism appears to be completely missing 
in M genitalium, but residual reactions may occu in M. 
pneumoniae. 

Enzyme common to several bacterial pathogens and 
eukaryotes; contributes to pathogenicity (25). 



www.sdenceinag.OTg • SCIENCE • VOU 278 • 24 OCTOBER 1997 



635 



(31) with the betalSaccamase smictuie, 
(ii) the presence of an acylphosphacase 
domain in hydrogenase expression feccors, 
which form ahi^ly conserved COG» and 
in a number of unchaiacterized proteins, 
and (ixi) the connection between a unique^ 
carbonic anhydrase and an acetyltransU 
feiase family (Table 2). 

Probably the most important applicaO 
tion of the COGs is functional characrerU 
iiation of newly sequenced genomes, in 
the preliminary analysis of the recently 
published genome of the major human 
bacterial pathogen HeUcobacter pylori (32), 
813 proteins (51% of the gene products) 
from this bacterium were included in 453 
pieQxisring COGs and 143 new COGs 
(33). In spite of the fact that many H.^ 
pyhri proteins are highly similar to hoU 
mologs from E . co2i and other bacteria and 



have been explored in detail (32), this 
analysis produced over 100 additional 
functional predictions (33). 

Conclusions and Perspective 

The COGb bring together the fields of 
comparative genomics and protein classiO 
fication. Among the numerous possible 
approaches to- protein classification, the 
COGs appear to be unique as a prototype 
of B natural system, which has as its basic 
unit a group of descendants of a single 
ancestral gene. Typically, such a group is^ 
associated with a conserved, specific fimcO 
tion, so that xhe inclusion of a protein in 
a COG automatically entaib functional 
prediction. 

Each COG contains corxserved genes 
from at least three phylogeneticaliy disO 



tant clades and, accordingly) corresponds 
to an ancient conserved region (ACR). 
Previous analyses have indicated that the 
total number of distinct ACRs is likely to 
be less than 1000 (34). Thus, even with 
the limited number of complete genomes 
currently available for analysis, the COGs 
have already captured a substantial fracO 
tion of all existing highly conserved proO 
tein domains. With more genomes includO 
ed in the system, the discovery of addiO 
tional COGs should gradually level off, 
with Ae great majority of the ACRs enU 
coded in the added genomes fitting into 
already known COGs. 

With die forthcoming flood of genome^ 
sequences, a coherent framework for underU 
standing these genomes from both the funcO 
tional and evolutionary viewpoints is a 
must. We regard the current collection of 



Table 2. Structural and lunctional predictions for uncharacterized proteins in COGs. 



Phylogenetic 
pattern- and 
COG ID* 



Proteins in COGt 



Activtty-and 
tunction 



Homolog in PDBt 
•BbTs detecrted (no.) 
•IjDwestPwtthaCOG 
member 



Comment 



e^pcmy 
COG0595R 



eh_^cmy 
COG0BD7R 



COG059BR 



e cm_ 

COG006BC 



e cm. 

COG0663R 



PhnP. 
ElaC-2g-2p-5c-Bm- 
YLR277C. YMR137C, 

. YKR07BC 



SseA, PspE, GlpE, 
YibN. YbbB. YnjE 
YgaP-2h-5c-MJ0D52"4y 



PldB, MhpC. YcdJ, 
YnbC-HIN0065- 
MG020-MP132-6C- 
YNR054C, YKL094W 

HypF-sII0322-MJ0713 



CalE. YrdA, Ydb2-sl!1 636, 
S1I1031-MJ03O4 



Predicted 
Zn-dependent 
liydrolases 



Predicted 
sultur- 
transferases 



Predicted 
hydrolases and 
a cyltransferases 



Hydrogenase 
maturation 
factor 



Predicted 
carbonic 
anhydrasBB 



Beta-tactamase 
(1BMC) 
•2 

•0.039 



Rhodanese (1RHD, 
20RA, 10RB) 
•2 



Lipases f21JP, 
ITAHIb, 1CVL). 
-3 

•a X io"S 

Acytpiiospliatase 
(lAPS) 
•2 

•2X10-= • 



Carbonic anhydrase 
from 

Methanosarcina 
thermophlla (ITHJ) 
•3 



Activtty Is not known for any protein In this ■ 
ubiquftous COG,' Biochemical and genetic 
data indicate thai YLR277c is involved in 
messenger RNA 3'-end processing (37), 
whereas YMR1 37c is *DNA cross-link repair 
protein SNM1 (39). A motJf Including the 
Zrvcoordinating histidlnes of bete-laDtamase 
is conserved. 
The suliurtransferase activity of SseA lnas beer) 
. demonstrated (40), but the rest of the 
proteins In this COG have no known activity. 
PspE (phage shock protein), GlpE 
(uncharacterized protein involved In glyoBrol 
metabolism), and other small proteins 
conrespond to one of the two rhodanese 
domains. 

PldB is knovm to possess triglyceride lipase 
activity (-M). All other proteins tn the COG 
haye not been characterized but now can be 
predicted to possess the ct- or ^-hydrotelse 
fold. 

HypF is required for hydrogenase biosynthesis 
(42), but no biochemical activity Is known. The 
-100 amino acid, NHj-terminal domain 
aligns with acyiphosphstase, with the catalytic 
residues conserved, suggesting thart HypF 
orthologs indeed possess acyiphosphatase 
activity. A PSJ- BLAST search with this domain 
as the query detected five additional liksty 
acylphosphatasBs. namely Eco//YccXand 
M. jannaschil MJ0B09. MJ0553, MJ1331, 
and MJ1405 (43). 

The biDChemlcal activity of the proteins In this 
COG is not known. They show not oiily 
conservation of htstidine residue comprising 
the active center of this unusual carbonic 
anhydrase (44) but also sig nlficant similartty to 
acetyltransferases of the isoleucine patch 
superfamlly (45), suggestin g an unexpected 
conrrection between the two types of 
enzymes. 



'The designations bib as in Teble 1 and Rg. 3. 
accession is indicated In parentheses. 



t2g indicalES two proteins trom M. genttekm, 2p Indicates two prot^ from t\L pneumoruae, end so forth. $Tr\B PDB 



636 



SdENCE • VOL. 278 • 24 OCTOBER 1997 • www^ciencemag.org 



COGs as a crude first version of such 
framework. Inclusion of additional, phyloO 
genetically diverse genomes and further deU 
veiopment of the procedures used xo derive 
and analyze COGs will hopefully result in 
refinement of this system, making it a solid 
platform for genome annotation and evoluO 
nonary genomics. 

REFERENCES AND NOTES 



1 . R. D. Fl^hmann Bt al., Science 26Q, A9S (1 995). 

2. C. M.FrasBret at, 270, 397(1 995); RHimmBl- 
raich et a/., Nuciet Adds Res. 24, 4420 (1996); T. 
KenekD et bL , DNA fiss. 3, 1 09 (1 996); F. R Btsttner 
et bL, Science 277, 1453 (1997). 

3. C. J.Butteta/.,So/encB273.-1Q5B(1996). 

4. A. Qofteau ef a/. , ibid. ZTA, BAB (1 99B); H. W. Mewes 

ay.. AteTvre 387,7(1997). 

5. C. a Woese. Cutt. Bid. 6, 1060 (1996); G. J. Osen 
and C. R. Wobsb, CeH B9. 991 (1 997); £ V. Koonln, 
Gename Res. 7, 41 B (1 997). 

6. £ V. Koonin, A. R Musheglen, K. E. Rudd, Curr, BioJ. 
6, 404 (1996); £ V. Koonln and A. R Mushegian, 
Curr, Opin. Qenet Dev. 6, 757 (1996). 

7. W. M. Rtch, Syst Zool. 19, 99 (1970). This definition 
may not embrace all of the comptodty of relatlon- 
shlpB between genes In dlftemnt genomes. For ex- 
ample, If genes A end 6 stb parBbgs encoded In 
genoms 1, end A* and B* are their respectiv© or- 
thoiogs in genome 2, whal is the approprtets de- 
Bcriptlon of the relallonslTip between A end B'7 They 
tomnally are not paraiogs, even though a generafized 
dsfihltlon mlglTt include such cases. FurtiTermore, 
one-tD-many and many-to-many orthologous rela- 
tionships evidently exist. 

B. W. M. Rtch, Philos. Trans. R Soc ijondon Ser. B 

349.93(1995). 
B. R L Tatusov et a/. , Cun. BioJ. 6, 279 (1 996). 

10. £ V. Koonln. A. FL Mushegiaa M. Y. Galperin, D. R 
Walker, MoL Mianbtol. 25, 619 (1997). 

11. The protein sequiences were from the original refer- 
ences (7-4), with modifications (for exempte. tenta- 
tive correction of '•frame-BlTlft errors) and addftlons 
(previously unreported predicted genes) mede lor £ 
coff (£ V. Koonln and R. L Tatusov, unpubHshad 
obsenratlons; K. £ Rudd, personal communicetlDn), 
H. inBuenzae (9), M. genltaSum and M. JannaschS 
(70), and S. cerevtslae (T. J. WoifsberB and D. 
Landsman, personal communloHtioh). The list of sys- 
tematic names for all £ col) genes provided by 
K. Rudd, and tine names for all yeast genes were 
provided by T. WoHsterg and D. Lendsman; the H 
infiuenzae genes were renamed as previously de- 
scribed (9); the gene names for the other spedes 
were from the original publications. The resulting 
protein database from complete genomes used in an 
comparisons contained 42B3 sequences from £ 
coll, 1703 sequences from H. inHuenzae, 468 se- 
quences from M. genitafiumt 577 sequences from 
M. pnEumoniae, 31 6B sequences from SynBCtiocys- 
fe sp., 173B sequences from M Jannasctvu and 
5932 BequenoBs from B. ceravfe/ae. totaling 17,967 



sequences. TMs sequence set is Bvaliablfi on the 
World Wide Web st http:Mvww.ncbLnlm.nih.0ov/ 
COG. AB palnMsB comparisons between these se- 
quences were performed using the B1.ASTPGP pro- 
grem, wHch b based on an enhanced verston of the 
BLAST Blgorftivn and includes anal^ of looal align- 
ments wfth gaps (26). Predicted coiled coll regions In 
protein Bequences vvere masked before the compar- 
ison using the batch verBlon of the C01LS2 program 
[A. Lupast l^ethods Brzymol. 256, 613 (1996); D. R 
WBll«r end £ V. Koonln, ISMB £, 333 (1997)), and 
BddttioneOy. regions of low complextty v/ere masked 
using the SEG program with dsfautt paiameters 
(J. C. Wootton and S. Federhen, Mettocte BizymoL 
266. 554 (1996)]. Before the detection of trfangtes of 
BeTs, paraiogs were identified as those proteins 
from the same flneags that showed greater slmllartty 
to each other than to any protein from another iln- 
eege. Forths purpose of triangle formstlon, paraiogs 
WBIB treated as a group. The elgortthm further in- 
ciuded varfflcatlon that the BeTs ImdudBd in atriangie 
formed a consistent nujttiple alignment; triangies that 
did not contain e conserved mottf w&re disregarded. 

12. Although the exact solutton depends on the amino 
Bdd composftion and dze of the particular pnjtBlns, 
under zero approximation, iTB (Irom genome i>) is the 
BeT for A (from genome e), and C (from genome c) is 
the BeT for B, the probability that C is the BeT tor A ' 
by chance is dose to 1/A/, wrhere AT is the number of 
genes in genome c, or ^0.001. 

13. C. R. Woeae, A*crobtoi R&/. 51. 221 (1987); 

, R. Ovarbeek, 6. J. Dlaen, J. BactBrtoi 17B, 

1 (1994); N. R Pace, Scfence 276. 734 (1997). A 
BbT to a given ciade was registered if detected in any 
of the coristttuent spsdes, for exarrple, in £ ooV or 

. H IrtRuenzae for the Qram-negative bacteria 

14. H. Watanabe and J. Otsuka, Conput AppL BioscL 
11, 159 (1095); £ V. Koonln. R. L Tatusov, K. £ 

■ Rudd, l^eithodsBnzymDl, 2B6, 295 (1996). 

15. A schematic visiml representHtlon of the search re- 
BUte was used for this aneiyBis fT. L Ivtedden, R L 
Tatusov, J: Zhang, Methods Bnzymol. 26B. -131 
(1995)]. 

1 6. A elngle-linkage ciustering procedure was used v^rtth 
random match probablDty. P < O.0D1 , as the cutoff 

- • 

17. A searchable database of COGs is available ^Hthttp:// 
wvw.ncbi.nlmjtlh.gDV/COG. Each COG Was as- 
signed B unique identfficatbn number, which in- 
cludes a letter for the functional category (79) and a 
numt>er (see exemples in Ffe. 1 end Tables 1 and 2). 

IB. M. LonBttb. M. Grtoskov. C. A. Gross, J. Bactand. 
174, 3B43 (1992). 

1 B. The broad functional categories of proteins were as 
defined previously (9), except that transcription was 
separated from replication, recombination, and re- 
pair. This dassfation is e modification of the sys- 
tem originally devebped for£ coff proteins \U. Riley, 
Microbiol. Rev, 57, BE2 (1993)). 

20. A partially simlter representafion of some of the prt>- 
t^ tamHies from complete genomes has been te- 
centiy published [R. A. Cleytoa 0. White, K A. 
Ketchum, J. C. Venter, Nature 3B7. 459 (1997)). 

21. R. F. DooQttle, D.-F. Feng, S. Tsang, G. Chao, £ 
Uttle, SdencB 271. 470 (1996). 



22. R Himmeireich et ai, AA/c/efc Acids ftes. 25, 701 
(1997). 

. 23. A. R Mushegian and £V.Koonin,ftDc. wart tocf. 
Sd. US.A 93, 1026B (1995). 

24. £ V. Koonin, A. R Ivlusheglan, P. Boric Trends Gan- 
et 12. 334 (1996). 

25. J. N. Weiser, M. Shchspetov, S. T. Chong. /nfect 
tmmun.6S, 943 (1997). 

25. J. P. Qogarten et al, Proc Natl. Acad. Sd USA 
85, 6661 (1 9B9); H. twabe etaL.tid^p. 9355; J. P. 
Gogarten. £ HBario, L OlendzewskI, In B/tiiution of 
MtrmsliJte, D. McL Roberts, P. Sharp, G. Alder- 
son, \A. Collins, Eds. (Cambridge Univ. Press, Cam- 
bridge, 1996), pp. 257-292. 

27. S. F. AitBchm Bf ai, Aft/ctefc Adds Rbb. 25, 3389 
(1997). The probability of a random match, P < 
0.001, was used In ell PSl- BLAST aearoliss. 

2B. J. E. Walker, M. Sarasta, 1^. J. Runswlck. H. J. Gay, 
' £MaOJ. 1.945(19B2);A. £GorbalanyaBnd£V. 
Koonin, Nudelc Adds Res. 17. B413 (1BB9); Wl 
SarastB. P. R. Sibbald, A. Wttinghoter, Trends Bio- 
Chem. Set 15,430(1990). 

29. Protein sepji^nces can be submitted for searching 
against COQs et http://wvvwjicbLnirTUTlh.gov/ 
COQ/cognttDT.html 

30. F. C. Bemsteine* a/., J. WoiBfai 11 2, 535 (1977). 

31 . G. Ctwrfreau, S. M. Noble. C. Guthrie, Scfence 274, 
1511 (1996); A Jenny, L Minvieiie-Sebastia, P. J. 
Preker, W. Keller, Ibid 274, 1514 (1996); G. Stumpf 
and H. Domriey, firftf., p. 1 51 7. 

32. J.-F. Torhbetai, Natura 3SB. 539 (1997). 

33,. £ V. Koonln, R L Tatusov, IvL Y. Galperin, hA. N. 
Rozanov, unpublished obsen/Btions. 

34. P. Green sfa/.,'Sctence 259, 1711 (1993). 

35. J, Neuhard and R. A. Keiln. in Escherichia coli and 
Salmonella- Ceilutar and Molecular Bidc^, F. C. 
Neldhartit ef aL ; Eds. (American Society torMicrobl- 
ology. Washington, DC, ad. 2, 1995), pp. 5BD-599. 

36. £ C. C. Lin, Ibtd., pp. 307-342. 

37. T. W. tvtorris, K, £ Reed, J. £ Cronan Jr., J. Bacte- 
riol. 177. 1 (1995). 

3a P. Sort;, t^. P. Brown. H. Hegyi. J. Schute. Pwtein 
Sd. 5. 1421 (1996). 

39. D. Richier, £ Niegemsnn. 1^. Brendel, Ato/. Gea 
Genet 231, 194 0 992); R Wolter, W. Stede, M. 
Brendel, tld 250, 1 62 (1 996). 

40. H. Hama, T. Kayahara. W. Ogawa, M. Tsuda, T. 
Tsuchlya, J. Biocham. 115, 1135 (1994). 

41. T. KobayasNffffl/., ibid. BB\ 101 (1985). 

4Z A ColbBBuere/.,MotM/cmbyo/. 8.15(1993). 

43. M. N. RozanDv and E. V. Koonln. unpubfished ofch 
servstions. 

44. B. £ Alber end J. G. Fen^, Proc Natl. Asad, Bel. 
V.SA. 91. 6909 (1994); C. Kisker eta/.. SMBO J. ns, 
2323 (1996). 

45. £ V. Koonin, PrtriefriSci 4, 1608(1 995); (\/l.t^ Rozanov 
and E V. Koonln. unpubfished observations. 

45. We thenk A. Schsffer for modifying the PSt-BLAST 
program; R Walker. H. Watenabe, and M. Roanov 
for valuable help wrfth date analysis: K Rudd, T. 
Wolfsberg, and D, Landsman tor unpubted data; 
end P. Bortt, M. Galperin, ivl. Gelfand, A. Mushegian, 
P. PBvzner, M. Roytberg, M. Rozanov, and R. Walker 
for helpful discussions. 



www^ciencemag.org • SCIENCE • VOL 278 * 24 OCTOBER 199? 



e37 



